A simulation of geographic distribution for the emergence of consequential SARS-CoV-2 variant lineages

The coronavirus disease 2019 (COVID-19) pandemic has been facilitated by the intermittent emergence of consequential variant strains. This study evaluated the geographic disproportionality in the detection of consequential variant lineages across countries. As of November 2021, a total of 40 potentially consequential SARS-CoV-2 variant lineages have been identified. One-hundred repeated simulations that randomly produced consequential variants from overall COVID-19 cases worldwide were performed to evaluate the presence of geographical disproportion in the occurrence of consequential variant outbreaks. Both the total number of reported COVID-19 cases and the number of reported genome sequences in each country showed weak positive correlations with the number of detected consequential lineages in each country. The simulations suggest the presence of geographical disproportion in the occurrence of consequential variant outbreaks. Based on the random occurrence of consequential variants among COVID-19 cases, identified consequential variants occurred more often than expected in the United Kingdom and Africa, whereas they occurred less in other European countries and the Middle East. Simulations of the occurrence of consequential variants by assuming a random occurrence among all COVID-19 cases suggested the presence of biogeographic disproportion. Further studies enrolling unevaluated crucial biogeographical factors are needed to determine the factors underlying the suggested disproportionality.

Coronavirus disease 2019 , caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), remains the world's largest public health concern in 2022 1 . As of April, 2022, more than half a billion of people have been infected by the virus, resulting in six million deaths worldwide 2,3 . Infection control measures including the distribution of COVID-19 vaccines have been implemented across worldwide from the relatively early phase of the pandemic 4 , but the global pandemic is still ongoing with sporadic emergence of consequential variants along with changed transmissibility or severity that require monitoring 5,6 .
The SARS-CoV-2 virus is a single-stranded positive-sense RNA virus with a genome size of approximately 29,900 bases (Wuhan-Hu-1 strain, GenBank Accession ID: NC_045512) 7,8 . The genome of SARS-CoV-2 includes a gene coding the nsp14 enzyme that repairs replication errors, realizing complex transcriptional and translational tasks with boosted replication fidelity 9,10 . However, coronaviruses have the longest genome sequences among RNA viruses, and errors during genome replication are common and diversified 11,12 . Until now, numerous gene mutations with amino acid replacement, gene insertions, or gene deletions have been reported in SARS-CoV-2 13,14 . Most mutations in the SARS-CoV-2 genome are known to have no notable positive effect on their transmissibility or survival 15,16 . As a result, many of these gene mutations are eventually eliminated from the environment. However, some mutations are consequential with resultant selective advantages, which can survive in the population and spread to be predominant in some populations based on natural selection or founder effects 17,18 .
As of the end of November 2021, a total of 40 potentially consequential variant lineages (33 variants with identified countries where they probably originated and 7 variants with unidentified countries where they originated), classified by the Phylogenetic Assignment of Named Global Outbreak (PANGO) have been detected worldwide and designated to be strains worth watching by the World Health Organization (WHO) 19,20 . As the emergence of consequential variants with enhanced transmissibility could certainly trigger a new outbreak in the www.nature.com/scientificreports/ invaded regions or countries, clarifying the possible mechanisms and factors that may facilitate the emergence of consequential variants are essential for controlling the pandemic 21 . Among the suspected factors that could potentially influence the infection dynamics and occurrence of consequential variant strains, biogeographical factors that may facilitate the evolutionary potential of the virus must be considered 22,23 . Therefore, this study aimed to evaluate and determine the biogeographic disproportionality in the emergence and detections of the potentially consequential variant lineages that required monitoring by different countries and biogeographic regions.

Methods
Study objectives and design. The main objective of the present study was to compare the actual numbers and simulated numbers of the consequential SARS-CoV-2 variants first detected in each country or biogeographical regions worldwide. A flowchart depicting the objectives and research design of this study is presented in Fig. 1. Furthermore, the relationship between the overall number of COVID-19 cases or shared viral genome sequences in each country and the number of consequential SARS-CoV-2 variants first detected in each country was evaluated.
Simulation of the occurrence of consequential lineages. A total of 100 simulations for the random occurrence of 40 consequential SARS-CoV-2 variants among overall COVID-19 cases in 223 areas and countries were performed, based on the assumption that the occurrence of consequential variant outbreaks would occur randomly among COVID-19 cases. More specifically, all countries were allocated a single or multiple unduplicated numbers between 1 and 10,000, the count of which correlates to the accumulated total number of COVID-19 cases by November 2021 in each country. The range of the allocated number (i.e., 1-10,000) was decided to avoid overestimating or underestimating the number of COVID-19 patients in smaller countries with fewer patients. The overall number of reported COVID-19 cases in each country was based on data reported by the Coronavirus Resource Center at Johns Hopkins University, USA (https:// coron avirus. jhu. edu/). Then, an arbitrary number between 1 and 10,000 was randomly selected by using the random numbers produced based www.nature.com/scientificreports/ on the Mersenne Twister algorithm 24 . In each simulation run, a total of 40 random numbers between 1 and 10,000 were produced, and the corresponding country for each of the produced random value was listed. The simulations were repeated 100 times. The difference in the time-varying reproduction number between countries or regions, which is essential when considering the spatiotemporal spread of infection in each locality, was not considered in the present study because the simulations in the present study adopted a null hypothesis that consequential variants occur completely at random among the overall COVID-19 cases in the world without biogeographical disproportion 25,26 .
Visual confirmation of the geographic disproportionality. The actual and simulated distributions for the occurrence of consequential SARS-CoV-2 variants were evaluated by plotting them on a world map to visually confirm the disproportionality of the distributions. The frequencies of occurrence between actual and simulated data were compared by country and biogeographic region. The biogeographic regions of the major prevailing countries were categorized in the following alphabetical order: (1)

Results
Detected potentially consequential SARS-CoV-2 variants. By 20,29 . Others are nominated as variants under monitoring (VUM) or de-escalated variants that dropped off from the prioritized watching list. The geographic distribution of the 33 variants with originating countries, which usually correspond to countries where the variants were first detected, is shown in Fig. 2. Notably, the country that first detected a variant that could differ from the origin country of the variant, such as the P.1 Gamma variant, which was first detected in travelers returning from Brazil to Japan 30,31 .
Of the 223 areas and countries, the number of consequential variant lineages in each country was weakly correlated with the total number of COVID-19 cases (rho = + 0.364; p < 0.0001) and number of shared genome sequences in each country (rho = + 0.338; p < 0.0001). However, these correlations were weakened by a large number of countries without first detected consequential variants. To further investigate the impact of the number of overall COVID-19 cases and shared genome sequences on the number of confirmed consequential variant lineages in each country, a scatterplot was constructed using these variables in each country (Fig. 3). Each plot in the figure represents a single country, with the size proportional to the number of consequential lineages originating in each country. This distribution implied that the number of overall COVID-19 cases would contribute more to the detection of consequential variants than the number of shared viral genome sequences.
Simulations for the random occurrence of consequential variants. Based on the assumption that the emergence of variant lineages, irrespective of consequentiality, occurs randomly from all infected populations worldwide, 100 simulations for the occurrence of consequential variants were repeatedly performed. The expected number of consequential variants in each of the 10 biogeographic regions, together with the actual detected numbers in each region, is shown in Fig. 4. The number of detected consequential variants in the UK and Africa were suggested to be higher than those expected from the simulation data. To visually confirm the suggested geographical disproportion in the occurrence of consequential variant outbreaks, simulation data regarding the geographical distribution of consequential variant outbreaks with the first four simulations are shown in Fig. 5. These simulation data suggest that the emergence of consequential variants may not be random. The simulated distributions, compared with the actual distributions of consequential variants, implied a possible www.nature.com/scientificreports/ presence of geographical disproportion in the occurrence of consequential variants not only between the 10 biogeographical regions but also between the countries in each of the regions. For example, the geographical distribution of the simulated (a) or observed (b) consequential SARS-CoV-2 variants in each European country is shown in Fig. 6. Of the 4000 consequential variants produced in the 100 simulations, 1069 were from European countries. Among them, only 184 (17.2%) were from the UK. The difference between the simulated data and actual observed distribution may imply that a threshold in the prevalence of the infection or some biogeographical disproportion may exist for the emergence and spread of consequential SARS-CoV-2 variants in each country. Similar results were obtained for African countries (Fig. 7). Of the 4000 consequential variants produced in the 100 simulations, 164 were from African countries. Among these, 60 (36.6%) were from South Africa.

Discussion
In the present study, the actual geographic distributions of the worldwide occurrence of potentially consequential SARS-CoV-2 variants were compared with 100-times simulated distributions, assuming a random occurrence of a consequential variant among overall COVID-19 cases. The strength of the simulation in the present study was that it enrolled all countries worldwide, even with small numbers of COVID-19 cases, not to underrepresent the contributions of countries with fewer COVID-19 cases. The simulation was repeated up to 100 times to obtain reliable data for the expected number of consequential variants first detected in each geographical region or country. The results suggest the presence of a discrepancy between the actual and simulated distributions. Such a geographical disproportion was implied in Europe, the Middle East, and Africa. In the Middle East and in European countries other than the UK, the actual numbers of first detected consequential variants were suggested to be higher than expected based on the random occurrence of consequential variants. Meanwhile, the actual numbers in the UK and Africa were suggested to be higher than expected, based on the random occurrence of consequential variants. A possible explanation for the observed geographic disproportionality may be the difference in the performed frequencies of genome-wide analysis for SARS-CoV-2 genes between countries and regions. However, as implied by the data obtained from the Nextstrain Study Group, this possibility seems less likely. The results of the present study imply that the number of overall COVID-19 cases would contribute more to the number of first detected consequential variants than the number of shared viral genome sequences in each country. In regions where geographical disproportions were suspected, most of the major constituent countries have performed and reported data with adequate qualities regarding the whole genome sequences from the early phase of the pandemic to evaluate the genetic diversity of SARS-CoV-2 in the regions [32][33][34][35][36][37][38] . Another possibility is that unevaluated factors that may produce biogeographical disproportion may have affected the occurrence and spread of the potentially consequential variant lineages. Conceivable factors may include host-side biological and genetic backgrounds (e.g., immunocompromised host), lifestyles in the locality, animals as possible natural reservoir hosts of the virus, and other unknown environmental and ecological factors [39][40][41][42][43] . These possibilities seem to be reasonable, as the virus replication and spread depend on host translation machinery 44,45 . Further studies are www.nature.com/scientificreports/ needed to determine whether such host-side factors with geographical disproportion behind the occurrence of consequential variants really exist. This study had some limitations. First, the correctness of assuming that the number of occurrences of the variant lineages is proportional to the number of overall COVID-19 cases in the region is uncertain. The geographic distribution of consequential variants could be attributed to multiple factors that were not evaluated in this study, as discussed above. Moreover, the performance levels of diagnostic screening tests or contact tracing may differ significantly between countries, making the reported numbers of overall COVID-19 cases across countries worldwide may be underestimated in many countries. Another limitation is that the exact relationship between the frequency of viral genome sequencing and the number of consequential variants first detected in each country is uncertain. These uncontrolled factors should be adequately considered and adjusted in future studies.

Conclusions
The results of the simulations in the present study demonstrated that there may be geographical disproportion in the occurrence of consequential SARS-CoV-2 variants between biogeographical regions and countries. This finding may imply that some unknown host-side factors may exist behind the emergence and spread of the new potentially consequential SARS-CoV-2 variants, and that the consequential variant outbreak may not occur completely at random among COVID-19 patients.   The results of the first four of the 100 simulations regarding the geographical distribution of the consequential variant outbreaks are shown. In the simulations, a random occurrence of a consequential variant lineage among all COVID-19 cases was assumed. Each filled dot represents the simulated occurrence of a consequential SARS-CoV-2 variant lineage. By comparing the obtained results with the actual distributions, as shown in Fig. 1, the actual numbers of the detected lineages in the areas of the Middle East and European countries other than the UK were lower than expected from the simulations, whereas the numbers in the UK and South Africa were higher than expected by the simulations. Color maps were created using the MapChart software.   Map for simulated or observed number of consequential SARS-CoV-2 variant first detected in African countries. Similar to color maps in Europe, the presence of biogeographical disproportion for the emergence and spread of consequential variants has been suggested in African countries. Color maps were created using the MapChart software.